statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for

model selection Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the ...

among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the

likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...

and it is closely related to the

Akaike information criterion The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to e ...

(AIC). When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in

overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC for sample sizes greater than 7. The BIC was developed by Gideon E. Schwarz and published in a 1978 paper, where he gave a

Bayesian Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...

argument for adopting it.

Definition

The BIC is formally defined as :

\mathrm = k\ln(n) - 2\ln(\widehat L). \

where *

\hat L

= the maximized value of the

of the model

M

, i.e.

\hat L=p(x\mid\widehat\theta,M)

, where

\widehat\theta

are the parameter values that maximize the likelihood function; *

x

= the observed data; *

n

= the number of data points in

x

, the number of

observation Observation is the active acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the perception and recording of data via the use of scientific instruments. The ...

s, or equivalently, the sample size; *

k

= the number of

parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

s estimated by the model. For example, in

multiple linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...

, the estimated parameters are the intercept, the

q

slope parameters, and the constant variance of the errors; thus,

k = q + 2

Derivation

Konishi and Kitagawa derive the BIC to approximate the distribution of the data, integrating out the parameters using

Laplace's method In mathematics, Laplace's method, named after Pierre-Simon Laplace, is a technique used to approximate integrals of the form :\int_a^b e^ \, dx, where f(x) is a twice-differentiable function, ''M'' is a large number, and the endpoints ''a'' an ...

, starting with the following

model evidence A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample from a prior and is therefore often referred to as model evi ...

: :

p(x\mid M) = \int p(x\mid\theta,M) \pi(\theta\mid M) \, d\theta

where

\pi(\theta\mid M)

is the prior for

\theta

under model

M

. The log-likelihood,

\ln(p(x, \theta,M))

, is then expanded to a second order

Taylor series In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor serie ...

about the MLE,

\widehat\theta

, assuming it is twice differentiable as follows: :

\ln(p(x\mid\theta,M)) = \ln(\widehat L) - \frac (\theta - \widehat\theta)^ \mathcal(\theta) (\theta - \widehat\theta) + R(x, \theta),

where

\mathcal(\theta)

is the average observed information per observation, and

R(x, \theta)

denotes the residual term. To the extent that

R(x, \theta)

is negligible and

\pi(\theta\mid M)

is relatively linear near

\widehat\theta

, we can integrate out

\theta

to get the following: :

p(x\mid M) \approx \hat L ^\frac , \mathcal(\widehat\theta), ^ \pi(\widehat\theta)

n

increases, we can ignore

, \mathcal(\widehat\theta),

and

\pi(\widehat\theta)

as they are

O(1)

. Thus, :

p(x\mid M) = \exp\left(\ln\widehat L - \frac \ln(n) + O(1)\right) = \exp\left(-\frac + O(1)\right),

where BIC is defined as above, and

\widehat L

either (a) is the Bayesian posterior mode or (b) uses the MLE and the prior

\pi(\theta\mid M)

has nonzero slope at the MLE. Then the posterior :

p(M\mid x) \propto p(x\mid M) p(M) \approx \exp\left(-\frac\right) p(M)

Usage

When picking from several models, ones with lower BIC values are generally preferred. The BIC is an increasing

function Function or functionality may refer to: Computing * Function key, a type of key on computer keyboards * Function model, a structured representation of processes in a system * Function object or functor or functionoid, a concept of object-oriente ...

of the error variance

\sigma_e^2

and an increasing function of ''k''. That is, unexplained variation in the

dependent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...

and the number of explanatory variables increase the value of BIC. However, a lower BIC does not necessarily indicate one model is better than another. Because it involves approximations, the BIC is merely a heuristic. In particular, differences in BIC should never be treated like transformed Bayes factors. It is important to keep in mind that the BIC can be used to compare estimated models only when the numerical values of the dependent variable are identical for all models being compared. The models being compared need not be

nested ''Nested'' is the seventh studio album by Bronx-born singer, songwriter and pianist Laura Nyro, released in 1978 on Columbia Records. Following on from her extensive tour to promote 1976's ''Smile'', which resulted in the 1977 live album '' Seas ...

, unlike the case when models are being compared using an

F-test An ''F''-test is any statistical test in which the test statistic has an ''F''-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model ...

or a

likelihood ratio test In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after ...

Properties

* The BIC generally penalizes free parameters more strongly than the

, though it depends on the size of ''n'' and relative magnitude of ''n'' and ''k''. *It is independent of the prior. * It can measure the efficiency of the parameterized model in terms of predicting the data. * It penalizes the complexity of the model where complexity refers to the number of parameters in the model. * It is approximately equal to the

minimum description length Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam ...

criterion but with negative sign. * It can be used to choose the number of clusters according to the intrinsic complexity present in a particular dataset. * It is closely related to other penalized likelihood criteria such as

Deviance information criterion The deviance information criterion (DIC) is a hierarchical modeling generalization of the Akaike information criterion (AIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been o ...

and the

Limitations

The BIC suffers from two main limitations # the above approximation is only valid for sample size

n

much larger than the number

k

of parameters in the model. # the BIC cannot handle complex collections of models as in the variable selection (or

feature selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...

) problem in high-dimension.

Gaussian special case

Under the assumption that the model errors or disturbances are independent and identically distributed according to a

normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...

and the boundary condition that the derivative of the

log likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...

with respect to the true variance is zero, this becomes (''up to an additive constant'', which depends only on ''n'' and not on the model): (p. 375). :

\mathrm = n \ln(\widehat) + k \ln(n) \

where

\widehat

is the error variance. The error variance in this case is defined as :

\widehat = \frac \sum_^n (x_i-\widehat)^2.

which is a biased estimator for the true variance. In terms of the residual sum of squares (RSS) the BIC is :

\mathrm = n \ln(RSS/n) + k \ln(n) \

When testing multiple linear models against a saturated model, the BIC can be rewritten in terms of the deviance

\chi^2

as:. :

\mathrm= \chi^2 + k \ln(n)

where

k

is the number of model parameters in the test.

Notes

References

External links

Information Criteria and Model Selection

Sparse Vector Autoregressive Modeling
{{DEFAULTSORT:Bayesian Information Criterion Model selection Information criterion Regression variable selection de:Informationskriterium

Definition

Derivation

Usage

Properties

Limitations

Gaussian special case

See also

Notes

References

Further reading

External links